承接 Day 4(GPU Operator 就緒)。 將NFS 與 Kubernetes 接起來,提供 Isaac Sim/Isaac Lab datasets 及 assets 儲存空間。
NFS 伺服器節點:nfs1
匯出目錄:/srv/nfs/isaac
:
assets/
:3D 場景、USD、URDF、貼圖、模型等datasets/
:錄製資料、訓練資料、實驗輸出cache/
:臨時快取(可清)K8s 命名空間:robotics
後續把 Selkies/Isaac 的 Deployment 掛載到
assets/
與datasets/
,並控制訪問權限與容量上限。
Ubuntu
sudo apt-get update
sudo apt-get install -y nfs-kernel-server
sudo mkdir -p /srv/nfs/isaac/{assets,datasets,cache}
sudo chown -R root:root /srv/nfs/isaac
sudo chmod 0775 /srv/nfs/isaac /srv/nfs/isaac/*
Rocky/Alma 9
sudo dnf install -y nfs-utils
sudo mkdir -p /srv/nfs/isaac/{assets,datasets,cache}
sudo chown -R root:root /srv/nfs/isaac
sudo chmod 0775 /srv/nfs/isaac /srv/nfs/isaac/*
/etc/exports
/srv/nfs/isaac 192.168.27.0/24(rw,sync,no_subtree_check,root_squash)
套用:
sudo exportfs -ra
NFS 主要連接埠:2049/tcp,udp;若使用 rpcbind:111/tcp,udp。
Ubuntu(UFW)
sudo ufw allow 2049
sudo ufw allow 111
Rocky(firewalld)
sudo systemctl enable --now nfs-server
sudo firewall-cmd --permanent --add-service=nfs
sudo firewall-cmd --permanent --add-service=rpc-bind
sudo firewall-cmd --reload
在 worker gpu01
mount:
sudo apt-get install -y nfs-common # (Ubuntu)
# 或 sudo dnf install -y nfs-utils # (Rocky)
sudo mount -t nfs nfs1:/srv/nfs/isaac /mnt
ls /mnt
sudo umount /mnt
預設
root_squash
能避免容器內 root 直接擁有 NFS root 權限;如需寫入權限,建議在容器使用一致的 UID/GID。
# nfs-pv-static.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-isaac-assets
spec:
capacity:
storage: 200Gi
accessModes: [ReadWriteMany]
persistentVolumeReclaimPolicy: Retain
nfs:
path: /srv/nfs/isaac/assets
server: nfs1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-isaac-assets
namespace: robotics
spec:
accessModes: [ReadWriteMany]
resources:
requests:
storage: 200Gi
volumeName: pv-isaac-assets
套用:
kubectl create ns robotics || true
kubectl apply -f nfs-pv-static.yaml
# test-nfs-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: nfs-rwx-check
namespace: robotics
spec:
restartPolicy: Never
containers:
- name: busy
image: busybox:1.37
command: ["sh","-lc","id; mkdir -p /mnt/test && echo ok > /mnt/test/hello && ls -l /mnt/test && sleep 5"]
volumeMounts:
- name: v
mountPath: /mnt
volumes:
- name: v
persistentVolumeClaim:
claimName: pvc-isaac-assets
helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm repo update
helm upgrade --install nfs-prov \
nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
-n storage --create-namespace \
--set nfs.server=nfs1 \
--set nfs.path=/srv/nfs/isaac \
--set storageClass.name=nfs-rwx \
--set storageClass.defaultClass=false \
--set storageClass.accessModes={ReadWriteMany}
這會建立一個
StorageClass: nfs-rwx
,往後只要宣告 PVC 指到這個 SC,就會在/srv/nfs/isaac/
下自動產生子目錄。
# pvc-datasets.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-datasets
namespace: robotics
spec:
accessModes: [ReadWriteMany]
storageClassName: nfs-rwx
resources:
requests:
storage: 500Gi
建立後,到
nfs1:/srv/nfs/isaac/
會看到對應子目錄(以 PVC UID 命名)。
# isaac-mounts-example.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: isaac-sandbox-demo
namespace: robotics
spec:
replicas: 1
selector:
matchLabels: {app: isaac-sandbox-demo}
template:
metadata:
labels: {app: isaac-sandbox-demo}
spec:
containers:
- name: app
image: nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04
command: ["bash","-lc","ls -al /assets; ls -al /datasets; sleep 3600"]
volumeMounts:
- name: assets
mountPath: /assets
- name: datasets
mountPath: /datasets
volumes:
- name: assets
persistentVolumeClaim:
claimName: pvc-isaac-assets
- name: datasets
persistentVolumeClaim:
claimName: pvc-datasets
root_squash:NFS 伺服器預設把客體端的 root 降權;因此在容器側 盡量用固定 UID/GID,並在 NFS 端建立對應目錄與擁有者。
範例(將 robotics
工作負載統一用 UID/GID 2000):
sudo groupadd -g 2000 robotics || true
sudo useradd -u 2000 -g 2000 -M -s /usr/sbin/nologin robotics || true
sudo chown -R 2000:2000 /srv/nfs/isaac/{assets,datasets,cache}
在容器中:
--user 2000:2000
可直接指定;或在 Pod 安全性內容:securityContext:
runAsUser: 2000
runAsGroup: 2000
fsGroup: 2000
スゲジュウルに間に合わなさそう、土日残業を考えるだけでも辛い、ぴえん。